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The enormous increase of popularity and use of tlie WWW lias led in the recent years to impor- 
tant changes in the ways people communicate. An interesting example of this fact is provided by 
the now very popular social annotation systems, through which users annotate resources (such as 
web pages or digital photographs) with text keywords dubbed tags. Understanding the rich emerg- 
ing structures resulting from the uncoordinated actions of users calls for an interdisciplinary effort. 
In particular concepts borrowed from statistical physics, such as random walks, and the complex 
networks framework, can effectively contribute to the mathematical modeling of social annotation 
systems. Here we show that the process of social annotation can be seen as a collective but un- 
coordinated exploration of an underlying semantic space, pictured as a graph, through a series of 
random walks. This modeling framework reproduces several aspects, so far unexplained, of social 
annotation, among which the peculiar growth of the size of the vocabulary used by the commu- 
nity and its complex network structure that represents an externalization of semantic structures 
grounded in cognition and typically hard to access. 
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I. INTRODUCTION 



The rise of Web 2.0 has dramatically changed the way in which information is stored and accessed, and the 
relationship between information and on-line users. This is prompting the need for a new research agenda about 
"Web Science", as put forward in [ij. A central role is played by user-driven information networks, i.e., networks 
of on-line resources built in a bottom-up fashion by Web users. These networks entangle cognitive, behavioral and 
social aspects of human agents with the structure of the underlying technological system, effectively creating teclino- 
social systems that display rich emergent features and emergent semantics [2, Q • Understanding their structure and 
evolution brings forth new challenges. 

Many popular Web applications are now exploiting user-driven information networks built by means of social anno- 
tations [J, l5|. Social annotations are freely established associations between Web resources and metadata (keywords, 
categories, ratings) performed by a community of Web users with little or no central coordination. A mechanism of 
this kind which has swiftly become well-established is that of collaborative tagging 0] , whereby Web users associate 
free-form keywords - called "tags" - with on-line content such as Web pages, digital photographs, bibliographic refer- 
ences and other media. The product of the users' tagging activity is an open-ended information network - commonly 
referred to as "folksonomy" - which can be used for navigation and recommendation of content, and has been the 
object of many recent investigations across different disciplines [1,0. Here we show how simple concepts borrowed 
from statistical physics and the study of complex networks can provide a modeling framework for the dynamics of 
collaborative tagging and the structure of the ensuing folksonomy. 

Two main aspects of the social annotation process, so far unexplained, deserve a special attention. One striking 
feature is the so-called Heaps' law (also known as Herdan's law in linguistics), originally studied in Information 
Retrieval for its relevance for indexing schemes [T]| . Heaps' law is an empirical law which describes the growth in a 
text of the number of distinct words as a function of the number of total words scanned. It describes thus the rate of 
innovation in a stream of words, where innovation means the adoption for the first time in the text of a given word. 
This law, also experimentally observed in streams of tags, consists of a power-law with a sub-linear behavior 0, In 
this case the rate of innovation is the rate of introduction of new tags, and a sub-linear behavior corresponds to a rate 
of adoption of new words or tags decreasing with the total number of words (or tags) scanned. Most existing studies 
about Heaps' law, either in Information Retrieval or in linguistics, explained it as a consequence of the so-called Zipf 's 
law prl . Il3| . It would instead be highly desirable to have an explanation for it relying only on very basic assumptions 
on the mechanisms behind social annotation. 

Another important way to analyze the emerging data structures is given by the framework of complex networks [3, 
These structures are indeed user-driven information networks i.e., networks linking (for instance) on-line 
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resources, tags and users, built in a bottom-up fashion through the uncoordinated activity of thousands to miUions 
of Web users. We shaU focus in particular on the particular structure of the so-called co-occurrence network. The co- 
occurrence network is a weighted network where nodes are tags and two tags are linked if they were used together by at 
least one user, the weight being larger when this simultaneous use is shared by many users. Correlations between tag 
occurrences are (at least partially) an externalization of the relations between the corresponding meanings p^ . IToj and 
have been used to infer formal representations of knowledge from social annotations [20| . Notice that co-occurrence 
of two tags is not a priori equivalent to a semantic link between the meanings/concepts associated with those tags, 
and that understanding what co-occurrence precisely means, in terms of semantic relations of the co-occurring tags, 
is an open question that is investigated in more applied contexts [2ll [22| . 

On these aspects of social annotation systems, a certain number of stylized facts, about e.g. tag frequencies 0, [§| 
or the growth of the tag vocabulary [T^ , have been reported but no modeling framework exists which can naturally 
account for them while reproducing the co-occurrence network structure. Here we ask whether one is able to explain 
the structure of such a network in terms of some suitable generative model and how the structure of the experimentally 
observed co-occurrence network is related to the underlying hypotheses of the modeling scheme. We show in particular 
that the idea of social exploration of a semantic space has more than a metaphorical value, and actually allows us to 
reproduce simultaneously a set of independent correlations and fine observables of tag co-occurrence networks as well 
as robust stylized facts of collaborative tagging systems. 



II. USER-DRIVEN INFORMATION NETWORKS 

We investigate user-driven information networks using data from two social bookmarking systems: del.icio.usjsij 
and BibSonomyjU. Del.icio.us is a very popular system for bookmarking web pages and pioneered the mechanisms 
of collaborative tagging. It hosts a large body of social annotations that have been used for several scientific in- 
vestigations. BibSonomy is a smaller system for bookmarking bibliographic references and web pages [1^. Both 
del.icio.us and BibSonomy are broad folksonomies (2^ . in which users provide metadata about pre-existing resources 
and multiple annotations are possible for the same resource, making the ensuing tagging patterns truly "social" and 
allowing their statistical characterization. 

A single user annotation, also known as a post, is a triple of the form (u, r, T), where m is a user identificator, r is the 
unique identificator of a resource (a URL pointing to a web page, for the systems under study), and T = {ii,t2, • ■ •} 
is a set of tags represented as text strings. We define the tag co-occurrence network based on post co-occurrence. 
That is, given a set of posts, wc create an undirected and weighted network where nodes are tags and two tags ti 
and t2 are connected by an edge if and only if there exists one post in which they were used in conjunction. The 
weight Wtit2 of an edge between tags ti and t2 can be naturally defined as the number of distinct posts where ti and 
t2 co-occur. This construction reflects the existence of semantic correlations between tags, and translates the fact 
that these correlations are stronger between tags co-occurring more frequently. We emphasize once again that the 
co-occurrence network is an externalization of hidden semantic links, and therefore distinct from underlying semantic 
lexicons or networks. 



A. Data from del.icio.us 



The del.icio.us dataset we used consists of approximately 5 ■ 10^ posts, comprising about 650 000 users, 1.9 • 10^ 
resources (bookmarks) and 2.5 • 10^ distinct tags. It covers almost 3 years of user activity, from early 2004 up to 
November 2006. Overall, 667128 user pages of the del.icio.us community were crawled, for a total of 18 782132 
resources, 2 454 546 distinct tags, and 140 333 714 tag assignments (triples). 

The data were subsequently post-processed for the present study. We discarded all posts containing no tags (about 
7% of the total). As del.icio.us is case-preserving but not case sensitive, we ignored capitalization in tag comparison, 
and counted all different capitalizations of a given tag as instances of the same lower-case tag. The timestamp of 
each post was used to establish post ordering and determine the temporal evolution of the system. Posts with invalid 
timestamps, i.e. times set in the future or before del.icio.us started operating, were discarded as well (less than 0.5% 
of the total). 

Except for the normalization of character case, no lexical normalization was applied to tags during post-processing. 
The notion of identity of tags is identified with the notion of identity of their string representation. 
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B. Data from BibSonomy 

BibSonomy [23j is a smaller system than del.icio.us, but it was designed keeping data sharing in mind. Because 
of this, there is no need to crawl BibSonomy by downloading HTML pages and parsing them. Direct access to post 
data in structured form is available by using the BibSonomy API (http://www.bibsonomy.org/help/doc/api.html). 
Moreover, the BibSonomy team periodically releases snapshot datasets of the full system and makes them available to 
the research community. For the present work we used the dataset released on January 2008 (https: / / www.kde.es. uni- 
kassel.de /bibsonomy / dumps /2007- 1 2-31 .tgz) . 

BibSonomy allows two different types of resources: bookmarks (i.e., URLs of web pages, similar to del.icio.us) and 
BibTeX entries. To make contact with the analysis done for del.icio.us, we restricted the dataset to the posts involving 
bookmark resources only. The resulting dataset we used comprises 1 400 users, 127 115 resources, 37966 distinct tags, 
and 503 928 tag assignments (triples). The data from BibSonomy was post-processed in the same way as the data 
from del.icio.us. 

While the BibSonomy dataset is much smaller than the del.icio.us dataset, it is a precious one: direct access to 
BibSonomy's database guarantees that the BibSonomy dataset is free from biases due to the data collection procedure. 
This is important because it allows us to show that the investigated features of the data are robust across different 
systems, and not only established in a case where biases due to data collection could be possible. 

C. Data analysis 

The study of the global properties of the tagging system, and in particular of the global co-occurrence network, is of 
interest but mixes potentially many different phenomena. We therefore consider a narrower semantic context, defined 
as the set of posts containing one given tag. Wc define the vocabulary associated with a given tag t* as the set of all 
tags occurring in a post together with t* , and the time is counted as the number of posts in which t* has appeared. 
The size of the vocabulary follows a sub- linear power-law growth (Fig.[T]), similar to the Heaps' law [l^l observed for 
the vocabulary associated with a given resource, and for the global vocabulary [l3 |. Figure [T] also displays the main 
properties of the co-occurrence network, as measured by the quantities customarily used to characterize statistically 
complex networks and to validate models p^.[l6j. These quantities can be separated in two groups. On the one hand, 
they include the distributions of single node or single link quantities, whose investigations allow to distinguish between 
homogeneous and heterogeneous systems. Figure [1] shows that the co-occurrence networks display broad distributions 
of node degrees kt (number of neighbors of node t), node strengths st (sum of the weights of the links connected to 
t, St = X^t' ''^tt'), and link weights. The average strength s{k) of vertices with degree fc, s{k) = '^t/kt=k where 
Nk is the number of nodes of degree fc, also shows that correlations between topological information and weights are 
present. On the other hand, these distributions by themselves are not sufficient to fully characterize a network and 
higher order correlations have to be investigated. In particular, the average nearest neighbors degree of a vertex t, 
knn,t = ^ X]t'GV(t) where V{t) is the set of i's neighbors, gives information on correlations between the degrees of 
neighboring nodes. Moreover, the clustering coefficient Ct = et/{kt{kt — l)/2) of a node t measures local cohesivencss 
through the ratio between the number et of links between the kt neighbors of t and the maximum number of such 
links 23]. The functions A:„„(fc) = -^Tlit/kt=k^rm.t and C(fc) = ■^J2t/kt=k^t are convenient summaries of these 
quantities, that can also be generalized to include weights (see SI for the definitions of fcjf„(/c) and C"'(fc)). Figure[T] 
shows that broad distributions and non-trivial correlations are observed. All the measured features arc robust across 
tags within one tagging system, and also across the tagging systems we investigated. 

III. MODELING SOCIAL ANNOTATION 

The observed features are emergent characteristics of the uncoordinated action of a user community, which call for 
a rationalization and for a modeling framework. We now present a simple mechanism able to reproduce the complex 
evolution and structure of the empirical data. 

The fundamental idea underlying our approach, illustrated in Fig. [21 is that a post corresponds to a random walk 
(RW) of the user in a "semantic space" modeled as a graph. Starting from a given tag, the user adds other tags, going 
from one tag to another by semantic association. It is then natural to picture the semantic space as network-like, 
with nodes representing tags and links representing the possibility of a semantic link [2^. A precise and complete 
description of such a semantic network being out of reach, we make very general hypothesis about its structure and 
we have checked the robustness of our results with respect to different plausible choices of the graph structure [1^. 
Nevertheless, as we shall see later on, our results help fixing some constraints on the structural properties of such 
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a semantic space: it should have a finite average degree together with a small graph diameter, which ensures that 
RWs starting from a fixed node and of limited length can potentially reach all nodes of the graph. In this framework, 
the vocabulary co-occurring with a tag is associated with the ensemble of nodes reached by successive random walks 
starting from a given node, and its size with the number of distinct visited nodes, NdisUnct, which grows as a function 
of the number of performed random walks njiw . 

A. Fixed length random walks 

Let us first consider random walks of fixed length / starting from a given node zq- We denote by pi the probability 
for each of these random walks to visit node i. The probability that i has not been visited after urw random walks 
is then simply 

Proba(i not visited) = (1 - pO"""* : (1) 

since the random walks are independent stochastic processes, and the probability that i has been visited at least once 
reads 

Proba(i visited)=l - Proba(i not visited) = 1 - (1 - p,;)"«"' . (2) 

The average number of distinct nodes visited after random walks is then given, without any assumption on the 
network's structure, by 

Nd^sUnct =J2^^- (1 -K)"''"') , (3) 
i 

where the sum runs over all nodes of the network. 

While this exact expression is not yet really informative, it is possible to go further under some simple assumptions 
(we also note that analytical results are available in the case of random walks performed either on lattices or on 
fractal substrates p7jV Since all the random walks start from the same origin iq, it is useful to divide the network 
into successive "rings" [l^, each ring of label / being formed by the nodes at distance I from ig- The ring Z = 1 is 
formed by the neighbours of iq, the ring Z = 2 by the neighbours' neighbours which are not part of ring 1, and so forth. 
We denote by Ni the number of nodes in ring /. We now make the assumption that all Ni nodes at distance I have 
the same probability to be reached by a random walk starting from (which is the sole element of ring 0). This is 
rigorously true for example for a tree with constant coordination number, and more generally will hold approximately 
in homogeneous networks, while stronger deviations are expected in heterogeneous networks. Let us assume moreover 
that the random walk of length Imax consists, at each step, of moving from one ring I to the next ring / + 1. This is 
once again rigorously true for a self-avoiding random- walk on a tree, and can be expected to hold approximately if Ni 
grows fast enough with I: the probability to go from ring I to ring Z -|- 1 is then larger than to go back to ring Z — 1 or 
to stay within ring I. For each random walk of length Imax, we then have pi = 1/Ni for each node i in ring I < Imax, 
and after Ufiw walks, the average number of distinct visited nodes reads 

'max 

Ndistinct = £ iVi(l - (1 - IM)"""') • (4) 

The expression Q lends itself to numerical investigation using various forms for the growth of Ni as a function 
of I. We obtain (not shown) that, as increases, NdisUnct increases, with an approximate power-law form, and 

saturates as njiw ^ oo at the total number of reachable nodes Y^\'Zo° Moreover, the increase at low n^vv' is 
sub- linear if Ni grows fast enough with I (at least ~ /^), and is closer to linear if l^ax increases. 

B. Random walks of randomlengths 

Empirical evidence on the distribution of post lengths (Fig. [2]) suggests to consider random walks of random lengths, 
distributed according to a broad law. Let us therefore now consider, under the same assumptions, that the successive 
random walks have randomly distributed lengths according to a certain P{1). Each ring I, on average, is then reached 
by a random walk njiw x J2i'>i P{^') = ''T-RwP>iO times, so that we have approximately 

oo 

Nd^sUnct = - (1 - l/iV;)"«-^>(0) , (5) 

1=0 
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where the sum (provided it converges) now runs over all possible lengths. 

If P{1) is narrowly distributed around an average value, the form ([5]) will not differ very much from the case of fixed 
length given by Eq. (j4|). Conversely, for a broad P{1), longer random walks will occur as nj^w increases and the tail 
of P{1) is sampled, allowing visits to nodes situated further from iq and avoiding the saturation effect observed for 
random walks of fixed length. 

In some particular cases, a further analytical insight into the form of Ndistinct{nRw) can be obtained: 

• assume that Ni ^ and that P{1) is power-law distributed {P{1) ^ ^/l^)- Then NcHgUnctinRw) ~ 

Z° (l — exp(~n/jv^'/(d''+''^^))), where c is a constant. The terms in the sum become negligible for I 

larger than while they are close to for smaller values of I. The sum therefore behaves as 

indistinct ~ ^RW ' v"/ 

i.e. a power-law. For instance, for 6 — 3 we obtain a sub-linear power-law growth with exponent (a -I- l)/(a + 2) , 
i.e. 2/3 for a = 1, or 3/4 for a = 2. 

• assume that Ni ^ z', which corresponds to a tree in which each node has z + 1 neighbours, and P{1) ~ l/l^. 
Then Ndistinct{nRw) ~ X^iS^o •^'(-'^ ~ exp(— nijvy/(cz'Z^~^)). As in the previous case, the terms in the sum 
become negligible for / larger than {log{nRw) — ^ 1) ^c>g{\og{nRw / ^og{z)))) / \og{z) , while they are close to 
for smaller /. Thus the sum behaves as 

Ndistinct nRw / 0'Og{nRw))''~^ , (7) 

i.e. we obtain a linear behaviour with logarithmic corrections, which is known to be very similar to sub-linear 
power-law behaviours. 

We have thus shown analytically, under reasonable assumptions, that performing fixed length random walks starting 
from the same node yields a growth of the number of distinct visited sites (representing the vocabulary size) as a 
function of the number of random walks (representing posts) which is sub-linear with a saturation effect, and that 
broad distributions of the walks lengths lead to sub-linear growths of the vocabulary, and avoid the saturation effect. 

Figure [3ftop) shows a confirmation of the appearance of a sub-linear powcr-law-like growth of Ndistinct, mimicking 
the Heaps' law observed in tagging systems, for random walks performed on a Watts- Strogatz network. 

IV. SYNTHETIC CO-OCCURRENCE NETWORKS 

Vocabulary growth is only one aspect of the dynamics of tagging systems. Networks of co-occurrence carry much 
more detailed signatures that present very specific features (Fig. [T]). Interestingly, our approach allows to construct 
synthetic co-occurrence networks: we associate to each random walk a clique formed by the nodes visited (see Fig. [2]), 
and consider the union of the urw such cliques. Moreover, each link i, j built in this way receives a weight equal to the 
number of times nodes i and j appear together in a random walk. This construction mimics precisely the obtention 
of the empirical co-occurrence network, and also reflects the idea that tags that are far apart in the underlying 
semantic network are visited together less often than tags which are semantically closer. Figures [3] and [4] show how 
the synthetic networks reproduce all statistical characteristics of the empirical data (Fig. [1]), both topological and 
weighted, including highly non-trivial correlations between topology and weights. Figure S] in particular explores how 
the weight Wij of a link is correlated with its extremities' degrees ki and kj. The peculiar shape of the curve can 
be understood within our framework. First, the broad distribution in I is responsible for the plateau 1 at small 
values of kikj, since it corresponds to long RWs that occur rarely and visit nodes that will be typically reached a very 
small number of times (hence small weights). Moreover, Wij ~ {kikj)"" at large weights. Denoting by fi the number 
of times node i is visited, Wij ^ fifj in a mean-field approximation that neglects correlations. On the other hand, ki 
is by definition the number of distinct nodes visited together with node i. Restricting the random walks to the only 
processes that visit i, it is reasonable to assume that such sampling preserves Heaps' law, so that ki oc where a 
is the growth exponent for the global process. This leads to Wij ^ [kikjY with a = 1/a. Since a ~ .7 — .8, we obtain 
a close to 1.3 — 1.5, consistently with the numerics. 

Strikingly, the synthetic co-occurrence networks reproduce other, more subtle observables, such as the distribution 
of cosine similarities between nodes. In a weighted network, the similarity of two nodes ii and 12 can be defined as 

sim(zi, Z2) ^ I ' (8) 

1 ^Y.t^\iY.iw\i 



6 



which is the scalar product of the vectors of normahzed weights of nodes ii and ^2- This quantity, which measures 
the similarities between neighbourhoods of nodes, contains non-trivial semantic information that can be used to 
detect synonymy relations between tags, or to uncover "concepts" from social annotations [1^. Figure [5] shows the 
histograms of pair-wise similarities between nodes in real and synthetic co-occurrence networks. The distributions are 
very similar, with a skewed behaviour and a peak for low values of the similarities. 

While the data shown in Fig.s [3] and H] correspond to a particular example of underlying network (a Watts-Strogatz 
network, see [2^) taken as a cartoon for the semantic space, we have also investigated the dependence of the synthetic 
network properties on the structure of the semantic space and on the other parameters, such as or the distribution 
of the random walk lengths. Interestingly, we find an overall extremely robust behavior for the diverse synthetic 
networks, showing that the proposed mechanism reproduces the empirical data without any need for strong hypothesis 
on the semantic space structure. The only general constraints we can fix on our proposed mechanism are the existence 
of an underlying semantic graph with a small diameter and a finite average degree (random walks on a fully connected 
graph would not work, for instance) and a broad distribution of post lengths. This lack of strong constraints on the 
precise structure of the underlying semantic network is actually a remarkable feature of the proposed mechanism. 
The details of the underlying network will unavoidably depend on the context, namely on the specific choice of the 
central tag t*, and the robustness of the generative model matches the robustness of the features observed in co- 
occurrence networks from real systems. Of course, given an empiric co-occurrence network, a careful simultaneous 
fitting procedure of the various observables would be needed to choose the most general class of semantic network 
structures that generate that specific network by means of the mechanism introduced here. This delicate issue goes 
beyond the goal of this paper, and also raises the open question of the definition of the minimal set of statistical 
observables needed to specify a graph (30j . 

V. CONCLUSIONS 

Investigating the interplay of human and technological factors in user-driven systems is crucial to understand the 
evolution and the potential impact these techno-social systems will have on our societies. Here we have shown that 
sophisticated features of the information networks stemming from social annotations can be captured by regarding the 
process of social annotation as a collective exploration of a semantic space, modeled as a graph, by means of a series 
of random walks. The proposed generative mechanism naturally yields an explanation for the Heaps' law observed for 
the growth of tag vocabularies. The properties of the co-occurrence networks generated by this mechanism are robust 
with respect to the details of the underlying graph, provided it has a small diameter and a small average degree. This 
mirrors the robustness of the stylized facts observed in the experimental data, across different systems. 

Networks of resources, users, and metadata such as tags have become a central collective artifact of the information 
society. These networks expose aspects of semantics and of human dynamics, and are situated at the core of innovative 
applications. Because of their novelty, research about their structure and evolution has been mostly confined to 
applicative contexts. The results presented here are a definite step towards a fundamental understanding of user-driven 
information networks that can prompt interesting developments, as they involve the application of recently-developed 
tools from complex networks theory to this new domain. An open problem, for instance, is the generalization of 
our modeling approach to the case of the full hypcr-graph of social annotations, of which the co-occurrence network 
is a projection. Moreover, user-driven information networks lend themselves to the investigation of the interplay 
between social behavior and semantics, with theoretical and applicative outcomes such as node ranking (i.e., for 
search and recommendation) , detection of non-social behavior (such as spam) , and the development of algorithms to 
learn semantic relations from large-scale dataset of social annotations. 
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FIG. 1: Data corresponding to the posts containing the tag " Folksonomy" in del.icio.us. Top: Heaps' law: growth of the 
vocabulary size associated with the tag t* =" Folksonomy" , measured as the number of distinct tags co-occurring with t* , as a 
function of the number Upoats of posts containing t* . The dotted line corresponds to a linear growth law while the continuous 
line is a power-law growth with exponent 0.7. Inset: Frequency-rank plot of the tags. The dashed line corresponds to a power- 
law — 1.42 ~ — 1./0.7. Middle and Bottom: Main properties of the co-occurrence network of the tags co-occurring with the tag 
"Folksonomy" in del.icio.us, built as described in the main text. Middle figure: Broad distributions of degrees k, strengths s 
and weights w are observed. The inset shows the average strength of nodes of degree k, with a superlinear growth at large 
k. Bottom figure: Weighted and unweighted (fc„„) average degree of nearest neighbors (top), and weighted (C™) and 

unweighted (C) average clustering coefficients of nodes of degree k. knn displays a disassortative trend, and a strong clustering 
is observed. At small k, the weights are close to 1 {s{k) ~ k, see inset of middle figure, and ~ A;„„, C™ ~ C. At large k 
instead, fc™„ > knn and C™ > C, showing that large weights are preferentially connecting nodes with large degree: large degree 
nodes are joined by links of large weight, i.e. they co-occur frequently together. In (B) and (C) both raw and logarithmically 
binned data are shown. 
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FIG. 2: Left: Illustration of the proposed mechanism of social annotation. The semantic space is pictured as a network in 
which nodes represent tags and a link corresponds to the possibility of a semantic association between tags. A post is then 
represented as a random walk on the network. Successive random walks starting from the same node allow the exploration of 
the network associated with a tag (here pictured as node 1). The artificial co-occurrence network is built by creating a clique 
between all nodes visited by a random walk. Right: empirical distribution of posts' lengths P{1). A power- law decay ~ 
(dashed line) is observed. 
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FIG. 3: Synthetic data produced through the proposed mechanism. Top: Growth of the number of distinct visited sites as 
a function of the number of random walks performed on a Watts-Strogatz network of size 5 ■ 10'* nodes and average degree 8, 
rewiring probability p — 0.1. Each random walk has a random length I taken from a distribution P{1) ~ The dotted line 
corresponds to a linear growth law while the continuous line is a power-law growth with exponent 0.7. Inset: Frequency-rank 
plot. The continuous and dashed line have slope —1.3 and —1.5, respectively. Middle and bottom: Properties of the synthetic 
co-occurrence network obtained for urw ~ 5 • 10^, to be compared with the empirical data of Fig. [T] 
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FIG. 4: Correlations between the weights of the links in the co-occurrence networks and the degrees of the links' endpoints, as 
measured by plotting the weight Wij of a link i,j versus the product of the degrees kikj. Top: co-occurrence network of the 
tag " Folksonomy" of del.icio.us; each green dot corresponds to a link; the black circles represent the average over all links i,j 
with given product kikj. Bottom: synthetic co-occurrence networks obtained from urw = 5 ■ 10* random walks performed on 
a Watts-Strogatz network of 10"" nodes. The black circles correspond to random walks of random lengths distributed according 
to P{1) ^ and the red crosses to fixed length random walks (/ = 5). 
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FIG. 5: Distributions of cosine similarities for real (top) and synthetic (bottom) co-occurrence networks. For del.icio.us, the 
tag number represents its popularity rank in the database. For the synthetic co-occurrence networks, the different curves 
correspond to different underlying networks on which the random walks are performed. 



